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Abstract 

Background: Predictive, stable and interpretable gene signatures are generally seen as an important step 
towards a better personalized medicine. During the last decade various methods have been proposed for that 
purpose. However, one important obstacle for making gene signatures a standard tool in clinics is the typical low 
reproducibility of these signatures combined with the difficulty to achieve a clear biological interpretation. For 
that purpose in the last years there has been a growing interest in approaches that try to integrate information 
from molecular interaction networks. 

Results: We propose a novel algorithm, called FrSVM, which integrates protein-protein interaction network 
information into gene selection for prognostic biomarker discovery. Our method is a simple filter based approach, 
which focuses on central genes with large differences in their expression. Compared to several other competing 
methods our algorithm reveals a significantly better prediction performance and higher signature stability. More- 
over, obtained gene lists are highly enriched with known disease genes and drug targets. We extendd our approach 
further by integrating information on candidate disease genes and targets of disease associated Transcript Factors 
(TFs). 

1 Introduction 

During the last decade the topic "personalized medicine" has gained much attention. One of the major goals 
is to identify reliable molecular biomarkers that predict a patient's response to therapy, including potential 
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adverse effects, in order to avoid ineffective treatment and to reduce drug side-effects and associated costs. 
Prognostic or diagnostic biomarker signatures (mostly from gene expression data, but more recently also 
from other data types, such as miRNA, methylation patterns or copy number alterations) have been derived 
in numerous publications for various disease entities. One of the best known ones is a 70-gene signature for 
breast cancer prognosis (mammaprint) by IT], which has gained FDA approval. 

A frequently taken approach to obtain a diagnostic or prognostic gene signature is to put patients into 
distinct groups and then constructing a classifier that can discriminative patients in the training set and is 
able to predict well unseen patients. Well known algorithms for this purpose are PAM [2], SVM-RFE [3], 
Random Forests [I] or statistical tests, like SAM [5j, in combination with conventional machine learning 
methods (e.g. Support Vector Machines, k-NN, LDA, logistic regression, ...). 

However, retrieved gene signatures are often not reproducible in the sense that inclusion or exclusion of 
a few patients can lead to quite different sets of selected genes. Moreover, these sets are often difficult to 
interpret in a biological way [6j. For that reason, more recently a number of approaches have been proposed, 
which try to integrate knowledge on canonical pathways, GO annotation or protein-protein interactions into 
gene selection algorithms [7- 15 . A review on these and other methods can be found in Cun and Frohlich [16] . 
The general hope is not only to make biomarker signatures more stable, but also more interpretable in a 



biological sense. This is seen as a key to making gene signatures a standard tool in clinical diagnosis 17 



In this paper we propose a simple and effective filter based gene selection mechanism, which employs the 
GeneRank algorithm [18] to rank genes according to their centrality in a protein-protein interaction (PPI) 
network and their (differential) gene expression. It has been shown previously that deregulated central genes 



have a strong association with the disease pathology in cancer [19]. Our method uses the span rule 20 



as a bound on the leave-one-out error of Support Vector Machines (SVMs) to filter the top ranked genes 
and construct a classifier. It is thus conceptually and computationally much simpler than our previously 
proposed RRFE algorithm [15], which used a reweighting strategy of the SVM decision hyperplane. We 
here demonstrate that our novel method, called FrSVM, not only significantly outperforms RRFE, PAM, 



network based SVMs 14 , pathway activity classification 11 and average pathway expression [10 , but that 
it also yields extremely reproducible gene signatures. 

In a second step we investigate, in how far our approach can be improved further by incorporating 
potential disease genes or targets of transcription factors, which were previously found to be enriched in 
known disease genes. It turns out that the combination with candidate disease genes can further improve 
the association to biological knowledge. 
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2 Methods 

2.1 Datasets 

We retrieved two breast cancer 21 22 and one prostate cancer [23] dataset from the NCBI GEO data 
repository [24|. Moreover, TCGA 25 was used to obtain an additional dataset for ovarian cancer (normalized 
level 3 data). All data were measured on Affymetrix HGU133 microarrays (22,283 probesets). Normalization 
was carried via FARMS (breast cancer datasets - [26]) and quantile normalization (prostate cancer dataset 

- [27]), respectively. As clinical end points we considered metastasis free (breast cancer) and relapse free 
(ovarian cancer) survival time after initial clinical treatment. For ovarian cancer only tumors with stages IIA 

- IV and grades G2 and G3 were considered, which after resection revealed at most 10mm residual cancer 
tissue and responded completely to initial chemo therapy. 

Survival time information was dichotomized into two classes according whether or not patients suffered 
from a reported relapse / metastasis event within 5 (breast) and 1 year (ovarian), respectively. Patients with 
a survival time shorter than 5/1 year(s) without any reported event were not considered and removed from 



our datasets. For prostate cancer we employed the class information provided by 23 . A summary of our 
datasets can be found in Table 1. 

2.2 Protein-Protein Interaction (PPI) Network 

A protein interaction network was compiled from a merger of all non-metabolic KEGG pathways [281- only 



gene-gene interactions were considered - together with the Pathway Commons database 29 , which was 
downloaded in tab-delimited format (May 2010). The purpose was to obtain an as much as possible compre- 
hensive network of known protein interactions. For the Pathway Commons database the SIF interactions 
INTERACTS_WITH and STATE_CHANGE were taken into accounlQ and any self loops removed. For 



retrieval and merger of KEGG pathways, we employed the R-package KEGGgraph 30 . In the resulting 
network graph (13,840 nodes with 397,454 edges) we had directed as well as undirected edges. For example, 
a directed edge A — > B could indicate that protein A modifies protein B (e.g. via phosphorylation). An 
undirected edge A — B implies a not further specified type of direct interaction between A and B. Nodes in 
this network were identified via Entrez gene IDs. 



The R package, hgul33a.db 31 , was employed to map probe sets on the microarray to nodes in the PPI- 
network. This resulted in a protein-protein interaction network matrix of dimension 8876 x 8876, because 
several probe sets can map to the same protein in the PPI-network. Accordingly, expression values for 



-"^http:/ /www. pathwaycommons.org/pc/sifJnteraction_rules. do 
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probesets on the microarray that mapped to the same gene in the network were averaged. Probesets, which 
could not be mapped to the PPI network, were ignored for all network based approaches except for RRFE, 
which according to Johannes et al 15 , assigns a minimal gene rank to them. 



2.3 Gene Selection with PPI Information (FrSVM) 

The GeneRank algorithm described in Morrison et al [18] is an adaption of Google's PageRank algorithm. It 
combines gene expression and protein-protein interaction information to obtain a ranking of genes by solving 
the linear equation system 

(I - dWD" 1 ) r = (1 - d)e (1) 

where W denotes the adjacency matrix of the PPI network, D is a diagonal matrix consisting of the node 
degrees and d a damping factor weighting (differential) gene expression e against network information. As 
suggested in Morrison et al [l8| we set d — 0.85 here. The general idea of the algorithm is to give preference 
to proteins, which are central in the network (similar to web pages with many links) on one hand and have 
a high difference in their expression on the other hand. 

As a score for differential gene expression (vector e) we employed the absolute value of t-statistics here. 
That means we conducted for each probeset a t-test and then looked at the absolute t-value to assign 
weights to nodes in the PPI network. This in turn allowed us to apply GeneRank to calculate a rank for 
each probeset. We then filtered the top ranked 10, 11, 30% of all probesets mapping to our PPI network 



and each time trained a Support Vector Machine (SVM). We used the span rule 20 to estimate an upper 
bound on the leave-one-out error in a computationally efficient way. This was only done on the training 
data and allowed us to select the best cutoff value for our filter. At the same time we could use the span 
rule also to tune the soft margin parameter C of the SVM in the range 10~ 3 , 10~ 2 , 10 3 . Out approach is 
called FrSVM in the following. 

2.4 Using Candidate Disease Genes 

For many diseases several associated genes are known. Based on this information it is possible to prioritize 
candidate genes via their similarity to known disease genes: Schlicker et al [32) proposed a mechanism to 
compute similarities of gene products to candidate genes based on their Gene Ontology (GO) information. 



The Endeavour software 33 employs a different algorithm to rank candidate genes based on their proximity 



in annotation space by combining information sources like GO, KEGG, text and others. 
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We here tested a combination of the propose FrSVM algorithm with both disease gene prioritization 
approaches (Endeavour and GO similarity): We selected the top ranked p% genes according to FrSVM as 
well as according to Endeavour and GO Similarity. The union of both sets was then used for SVM training. 
For Endeavour we considered GO, KEGG, text and sequence motifs as information sources. Information on 
disease related genes was obtained from the DO-light ontology |34|. GO functional similarity was computed 



via the method proposed in [35] using the web tool FunSimMat 36 , which uses the NCBI OMIM database 
for disease gene annotation. The combination of FrSVM with Endeavour is called FrSVM JEN, and the 
combination with functional GO similarities is called FrSVM FunSim accordingly. 

In addition to FrSVMJEN and FrSVMJFunSim we also considered to use the top ranked candidate 
disease genes only (without any further network information) . The corresponding approaches are principally 
equivalent to FrSVM from the methodological point of view (just another ranking is used) and are called 
EN and FunSim, respectively. 

2.5 Using Targets of Enriched Transcription Factors 

A major factor influencing gene expression are transcription factors (TFs). We performed a hypergeomctric 
test looked for enriched TF targets in disease associated genes (FDR cutoff 5%). Only probesets mapping to 
targets of enriched TFs were then taken into account to conduct a subsequent FrSVM training. We refer to 
this method as FrSVMJTF. Again, information on disease relation of genes was obtained from the DO-light 
ontology. A TF-target gene network was compiled by computing TF binding affinities to promoter sequences 



of all human genes according to the TRAP model 37 via the author's R implementation. Upstream 



sequences of genes were retrieved here from the ENSEMBL database via biomaRt 38 . We assumed that 
promoter sequences were located in the range - 2Kbp upstream to the transcription start site of a gene. As 
trustworthy TF targets we considered those, for which a Holm corrected affinity p-value smaller than 0.01 
was reported. In conclusion we found 6334, 8196 and 5866 probesets (having enriched binding sites of 33, 
35 and 24 TFs) for breast, prostate and ovarian cancer. 

2.6 Classification Performance, Signature Stability and Biological Interpretability 

In order to assess the prediction performance we performed a 10 times repeated 10-fold cross-validation on 
each dataset. That means the whole data was randomly split into 10 fold, and each fold sequentially left 
out once for testing, while the rest of the data was used for training and optimizing the classifier (including 
gene selection, hyper-parameter tuning, standardization of expression values for each gene to mean and 
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standard deviation 1, etc.). The whole process was repeated 10 times. It should be noted extra that also 
standardization of gene expression data was only done on each training set separately and the corresponding 
scaling parameters then applied to the test data. The area under receiver operator characteristic curve (AUC) 
was used here to measure the prediction accuracy, and the AUC was calculated by R-package ROCR 39 . 
To assess the stability of features selection methods, we computed the selecticomparableon frequency of each 
gene within the 10 times repeated 10-fold cross-validation procedure. In an ideal case probsets would be 
selected consistently, i.e. all probeset chosen 100 times. The more the probcsct selection profile (which is 
essentially a histogram) resembles this ideal case the better. In order to capture this behavior numerically 
we defined a so-called stability index (SI) defined as 

si= Yl * ■ /(o 

iG{10,20,...,100} 

where f(i) denotes the fraction of probsets that have been selected> i — 10 and < i times. Please note that 
Si /(*) = 1- SI represents a weighted histogram count of selection frequencies. Obviously, the larger SI 
the more stable the algorithm is. In the optimal case SI = 100. 

We also looked, in how far signatures obtained by training the classifier on the whole dataset could be 
related to existing biological knowledge. For this purpose we looked for enriched disease related genes and 
known targets of therapeutic compounds via a hypergeometric test. For disease related genes we made use 
of the tool "FunDO" [34]. Multiple testing correction is done here via Bonferroni's method. The list of 
therapeutic compounds and their known targets was retrieved via the software MetaCore™(GeneGo Inc.) 
and is available in the supplements. 

3 Results and Discussion 

3.1 FrSVM improves Classification Performance and Signature Stability 

We compared the prediction performance of our proposed FrSVM method to PAM [2], average gene expres- 
sion of KEGG pathways (aveExpPath, [l0]), pathway activity classification (PAC, [ll]), network-based 
SVM (networks VM, |14 ) and reweighted recursive feature elimination (RRFE, [15]). For aveExpPat we 
first conducted a global test 40 1 to select pathways being significantly associated with the phenotype (FDR 
cutoff 1%) and then computed the mean expression of genes in these pathways. 

Initially we only used PPI information for our FrSVM approach and found a clear improvement of AUC 
values for FrSVM compared to all other tested methods (Figure 1). This visual impression was confirmed 
via a two-way ANOVA analysis (using method, dataset as well as their interaction term as factors) with 



G 



Tukey's post-hoc test, which revealed a significantly increased AUC for FrSVM with p < le-6 in all cases. 

We further inspected the frequencies, by which individual probesets were selected by each of the tested 
methods (Figure 2) as well as the stability indices (Figure 2b). This analysis showed that FrSVM selected 
probesets in a very stable manner (only comparable to networkSVM) . The fraction of consistently selected 
probesets ranged from ~40% (ovarian cancer) to ~70% (Schmidt et al. breast cancer dataset). Interestingly 
these consistently selected genes typically showed a highly significant differential expression, which was 
assessed via SAM [5j here. For example, 60% of all consistently selected probesets in the Schmidt et al. 
dataset had a q- value < 5%. This illustrates the behavior of FrSVM to focus on genes with large differences 
in their expression between the two compared groups, which are central in the PPI network. 

3.2 Clear Association to Biological Knowledge 

We trained each of our test methods on complete datasets to retrieve final signatures, which we tested 
subsequently for the enrichment of disease related genes and known drug targets (Figure 3 and Figure 4). 
This analysis showed that FrSVM derived signatures can be clearly associated to biological knowledge. The 
degree of enrichment was only comparable with avcExpPath and RRFE, which have previously been found 
to yield clearly interpretable signatures [4l] . 

3.3 PPI Network Integration Helps Most 

We went on to test, how much the performance of FrSVM would be affected by integrating candidate disease 
genes or restricting selectable probesets to targets of enriched TFs. Generally, incorporation of network 
knowledge appeared to yield a better prediction performance than only using candidate disease genes Figure 
5, p < 0.01, two-way ANOVA with Tukey's post-hoc test). No significant benefit of additionally integrating 
candidate disease genes or targets of enriched TFs into FrSVM could be observed in terms of AUC values 
or signature stabilities Figure 5b). However, FrSVM_EN showed a clearer association to disease genes than 
FrSVM (Figure 6). This is not surprising, because the method explicitely integrates the top ranked candidate 
disease genes. 

4 Conclusion 

We proposed a simple and effective filter based algorithm to integrate PPI network information into prog- 
nostic or diagnostic biomarker discovery based on a modification of Google's PageRank algorithm. The 
method favors genes, which on one hand show a large difference in their expression (high absolute t-score) 
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and on the other hand are central in the network. It has been shown previously that such genes are often 
associated to the disease phenotype 19 . Our approach significantly outperformed several other classifica- 
tion algorithms in terms of prediction performance and signature stability on four datasets. Moreover, it 
yielded signatures showing a very clear relation to existing biological knowledge. Additional integration of 
potential disease genes could further enhance this association, but nonetheless did not improve prediction 
performance or signature stability. PPI network integration appeared to be more effective than integration 
of candidate disease genes. Using only targets of TFs, which were previously found to be enriched in known 
disease genes, did not reveal any significant improvement. However, from a computational point of view this 
approach might still be interesting, because the set of candidate probesets is significantly restricted before 
any time consuming machine learning algorithm is applied. 

In conclusion, our method offers a computationally cheap and effective mechanism to include prior knowl- 
edge into gene selection for biomarker discovery. 
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Figures 

Figure 1 - Prediction performance of FrSVM in comparison to other methods in terms of area under 
ROC curve (AUC). 
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Figure 2 - Fraction of probesets that were selected 1 - 10, 11 - 20 99 - 100 times within the 10 

times repeated 10-fold CV procedure. 




20 40 60 80 100 20 40 60 80 1 00 




13 



Figure 3 - Stability indices (SI) of compared methods. 
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Figure 4 - Enrichment of signatures with disease related genes. 

BC(Schmidt et al.) 



PAM 
aveExpPath 
PAG 

networkSVM 
RRFE 
FrSVM 



PAM 

aveExpPath 

PAC 

networkSVM 
RRPC 
FrSVM 



I 




PAM 

aveExpPath 
PAC 

networkSVM 
RRFE 
FrSVM 



-loglO(p-vaPue) 
BC(lvsina et al) 




-foglOtp-vafue) 
Prostate cancer 




terrr 



Breast cancer 
Cancer 
Primary tumor 



terrr 



Breast cancer 
Cancer 
Primary tumor 



terrr 



Cancer 
Primary tumor 
Prostate cancer 



-?og10(p-vaPue 



Ovarian cancer 

















PAM - 

aveExpPath - 
PAC - 
networkSVM - 
RRFE 
FrSVM - 














k 
















1 

L. 

















-fog 1 0(p-vafue) 



tern 



Cancer 

Ovarian cancer 
Primary tumor 



15 



Figure 5 - Enrichment of signatures with known drug targets. 
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Figure 6 - EfFect of integrating prior information in addition to protein interac- tions into FrSVM: 
prediction performance. 
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Figure 7 
index 



Effect of integrating prior information in addition to protein interac-tions into FrSVM: stability 
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Figure 8 - Enrichment of signatures with disease related genes after integration of prior information 
additional to protein interactions. 
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GEOid 


examples 


cancer type 


Predict label 


Positive 


Data source 


GSE11121 


182 


Breast Cancer 


DFS smaller than 5 years V.S. 5 years 


28 


Schmidt et al 


22] 


GSE4922 


228 


Breast cancer 


DFS smaller than 5 years V.S. 5 years 


69 


Ivshina et al 


21 


TCGA 


135 


Ovarian Cancer 


relapse free survival > ly 


35 


TCGA 


25 




GSE25136 


79 


Prostate Cancer 


Recurrent V.S. Non-Recurrent 


40 


Sun et al 


21 


1 



Tables 1 - Overview about employed datasets 
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